法国专利FR3072851A1 REALIZING LEARNING TRANSMISSION RESOURCE ALLOCATION METHOD

专利PDF首页>>法国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
A method of scheduling packets belonging to a plurality of data flow categories in a multi-access telecommunication system sharing a plurality of transmission resources. The method includes at each transmission time interval the selection, by a scheduler (CT, CF) of a resource allocation plan and an allocation of the transmission resources to the data streams in accordance with the selected resource allocation plan. This selection is performed by querying a correspondence table (LUT) whose content results from the implementation of reinforcement learning and which makes it possible to identify, from the current state of the multi-party telecommunications system, access (s [t]]), the resource allocation plan to be selected, this plan being optimal to satisfy heterogeneous needs in terms of quality of service.
公开号:FR3072851A1
申请号:FR1759982
申请日:2017-10-23
公开日:2019-04-26
发明作者:Ioan-Sorin COMSA；Antonio DE DOMENICO；Dimitri Ktenas
申请人:Commissariat a lEnergie Atomique CEA；Commissariat a lEnergie Atomique et aux Energies Alternatives CEA；
IPC主号:

专利说明:

METHOD FOR ALLOCATING REINFORCING LEARNING TRANSMISSION RESOURCES
DESCRIPTION
TECHNICAL AREA
The field of the invention is that of multi-access telecommunications systems offering a plurality of transmission resources. The invention relates to the management of transmission resources and relates more particularly to a scheduling implemented in two decoupled stages with first of all a scheduling in the time domain for selecting a group of data streams on the basis of needs in terms quality of service, typically more urgent data flows, then scheduling in the frequency domain taking into account the state of the transmission channel to allocate radio resources to the data flows of the selected group.
PRIOR STATE OF THE ART
Mobile network 5 ^th generation will be characterized by the presence of multiple services whose needs are highly heterogeneous. For example, Extreme Mobile Broadband services require very high data rates, Ultra-reliable Low latency applications require very low latencies and very low data rates. 'error, and Massive Machine-type Communication services will impose the presence of a very high density of user connections. These different needs are often contradictory and the joint optimization of these needs is a difficult problem to solve.
The OFDMA (Orthogonal Freguency Division Multiple Access) access technique allows fine-grained radio resource allocation where each RB (Resource Block) resource unit can be allocated and modulated adaptively to exploit and take advantage of the frequency diversities of each user. To meet the diverse needs in terms of Servive (QoS Quality of Service) of the different services of a network of 5 ^th generation, flexible management of the radio resource (RRM Radio Resource Management) is needed.
Radio resource management uses a packet scheduler which is responsible for sharing the radio resource between the different active users at each transmission time interval. This scheduler has a set of scheduling rules used to control parameters such as transmission power, coding and modulation scheme and allocated bandwidth. The objective is to use radio resources as efficiently as possible to increase spectral efficiency, while guaranteeing users the level of service required by each traffic request.
We can distinguish the following four types of scheduling rules.
Rules of type "Channel unknown / QoS unknown". These rules are characterized by a very simple implementation requiring neither knowledge of a channel quality indicator (CQI for Channel Quality Indicator) nor QoS measurements. For example, a Round-Robin type scheduler allocates identical bandwidth to each active user;
“Unknown channel / Known QoS” type rules. These rules aim to satisfy QoS needs without taking into account CQI information. For example, the EDF rule (for Earliest to Deadline First) assigns a higher priority to a packet as soon as its sending deadline is near, without any consideration of the state of the channel;
“Known channel / Unknown QoS” type rules. These rules provide a good compromise between performance and complexity by taking into account the diversity of users. However, their performance remains subject to the reliability of CQI information. We can notably cite the MCI rule (for Maximum Carrier to Interference) which aims to maximize the instantaneous total throughput;
“Known channel / Known QoS” type rules. These rules take into account both QoS and CQI information. We can notably cite the rules
PF-BF (for Proportional Fair - Barrier Function), PF-EXP (for Proportional Fair EXPonential) and PF-OPLF (for Proportional Fair - Opportunistic Packet Loss Fair) which endeavor respectively to guarantee a throughput, to minimize delays and to reduce the rate of packet loss.
None of these rules can be considered the most appropriate solution for all network conditions and all QoS needs. To meet time constraints, a decoupling in two decoupled stages was proposed with first of all a scheduling in the time domain to select a group of packets on the basis of needs in terms of quality of service, typically packets more urgent, then a scheduling in the frequency domain taking into account the state of the transmission channel to allocate the radio resources to the packets of the selected group. Such scheduling is for example known from patents EP 2 148 478 B1 and EP 2 936 908 B1. However, these scheduling solutions do not make it possible to differentiate multiple categories of traffic having different needs in terms of QoS.
STATEMENT OF THE INVENTION
The object of the invention is to propose a method for scheduling packets belonging to a plurality of packet categories in a multi-access telecommunications system sharing a plurality of transmission resources, capable of processing heterogeneous data flows while being able to offer optimal performance in terms of satisfying QoS constraints.
To this end, the invention proposes a method which comprises, for a state of the multi-access telecommunications system, the determination, by iterations of a learning by reinforcement, of a resource allocation plan coming to maximize a sum of rewards. Each iteration of reinforcement learning includes:
- allocation of transmission resources to data flows according to a resource allocation plan;
- the transmission of packets by means of the allocated transmission resources;
- for each of the categories of data flows, the calculation of at least one transmission performance indicator for each of the data flows of the category, and the comparison, for each of the data flows of the category, of the at least one transmission performance indicator calculated at a threshold representative of a quality of service constraint relating to the at least one transmission performance indicator for the category, and
- the determination of a reward according to the result of said comparison for each of the data flows of each of the categories.
Some preferred but non-limiting aspects of this process are as follows:
the resource allocation plan comprises a prioritization rule and at least a first and a second scheduling rule, the allocation of transmission resources comprising two steps consisting of:
o scheduling in the time domain performed by a time classifier configured to prioritize the categories of data flow according to the hierarchy rule into at least one priority category and one secondary category; and o scheduling in the frequency domain carried out by a frequency classifier configured for:
schedule the data streams of the priority category according to the first scheduling rule and allocate transmission resources to the data streams of the priority category thus scheduled; and in the case of remaining resources, schedule the data streams of the secondary category according to the second scheduling rule and allocate remaining transmission resources to the data streams of the secondary category thus scheduled.
- each iteration of reinforcement learning includes:
o according to a probability P _a , a selection of operations consisting in selecting the resource allocation plan coming to maximize the sum of the rewards at this stage of the iterations; and o according to a probability P _a *, an exploration selection consisting in selecting a resource allocation plan different from the resource allocation plan coming to maximize the sum of the rewards at this stage of the iterations.
- the at least one transmission performance indicator is one of a transmission delay, a transmission rate and a transmission loss rate.
BRIEF DESCRIPTION OF THE DRAWINGS
Other aspects, aims, advantages and characteristics of the invention will appear better on reading the following detailed description of preferred embodiments thereof, given by way of non-limiting example, and made with reference to the accompanying drawings on which ones :
- Figure 1 is a diagram illustrating a packet scheduling method exploiting a look-up table (LUT) whose content results from the implementation of reinforcement learning;
- Figure 2 is a diagram illustrating a possible embodiment of learning by reinforcement implementing an exploitation / exploration compromise.
DETAILED PRESENTATION OF PARTICULAR EMBODIMENTS
The invention takes place in the context of a multi-access telecommunications system where a plurality of users and / or applications have access to common transmission resources, and in which, for each access, the data are transmitted in the form of packages. A typical example of use of the invention is that of an OFDMA system in which the available bandwidth is divided into N _RB resource units RB (Resource Block) where a resource unit RB represents the minimum quantity of frequency resources that 'a base station can allocate to each transmission interval (TTI for Transmission Time Interval).
In the context of the invention, the telecommunications system offers its transmission resources to P categories of data flow (in what follows a data flow is associated with a user). We thus note T _C = {1 ... P] the set of data flow categories. Each pET _c category is characterized by a set of active users UE _P = {1 ... / _p ] (ie a set of data flows) and a set of objectives and specific requirements in terms of QoS, noted O _P = {1 ... N _Op ], this set O _P comprising for example a guaranteed bit rate (GBR), a delay, a packet loss rate (PLR).
In order to efficiently schedule categories of heterogeneous data streams having different priorities, the invention uses a scheduler configured to, at each transmission time interval, select a resource allocation plan and perform an allocation of the N _RB resources. transmission to data flows in accordance with the selected resource allocation plan. In a preferred embodiment represented in FIG. 1, the planner is configured to carry out a scheduling in two decoupled steps, with first of all a scheduling step in the time domain TD then a scheduling step in the frequency domain FD. As will be detailed later, the resource allocation plan in this case includes a prioritization rule and at least a first and a second scheduling rule.
At a transmission interval TTI t, the telecommunications system has a current state s [t] and each category of data stream pET _c has a current state s _p [tj. The scheduling in the time domain is carried out by a time classifier CT configured to carry out the selection of a hierarchy rule and the hierarchy of the data flow categories according to the hierarchy rule selected into at least one priority category and a secondary category . The time classifier CT thus prioritizes the categories of data flow according to their current states and forwards it iteratively to a frequency classifier CF, from the highest priority to the lowest priority, to define the scheduling in the frequency domain .
One possible prioritization strategy consists in determining the ordering of the data flows according to a priority level ("priority level") which corresponds to each flow, as defined for example in table 6.1.7 of the 3GPP Technical Specification Group Services and System Aspects, "TS 23.203, Policy and charging control architecture," (Release 14), V14.4.0, June 2017. Another possible strategy is the prioritization of data flows giving priority to GBR flows (guaranteed speed flows, GBR designating a "Guaranteed Bit Rate") vis-à-vis non-GBR flows (flows with non-guaranteed debit).
In the example in Figure 1, it is category p which is selected as the priority category.
The scheduling in the frequency domain is carried out by the frequency classifier CF which uses as input the state of the users of the selected category p. The frequency classifier is configured to perform:
o the selection of a first scheduling rule, the scheduling of priority category data streams according to the first selected scheduling rule and an allocation of resources to the priority category data streams ordered in accordance with first scheduling; and o in the case of remaining resources, the selection of a second scheduling rule, the scheduling of the data streams of the secondary category according to the second selected scheduling rule and an allocation of the remaining resources to the streams of data of the secondary category ordered according to the second scheduling.
The frequency classifier CF thus selects the most appropriate scheduling rule u _p from N _R possible rules (u _p GU = {0,1, ..., JV _R }) to improve the QoS of the data flows of the category priority p. Here u _k = 0 indicates that the category k is not the subject of a scheduling due to the absence of remaining resources. Examples of scheduling rules have been given in the introductory part.
The application of the most appropriate scheduling rule u _p provides for each user i _p G UE _P of the selected category p its priority Ψ _Μρ , _ίρ in the resource allocation scheme. Resource allocation AR is performed iteratively by allocating a resource unit RB available to the user of highest priority, or until there are no more resources available (ie all the units RB resource have been allocated) or until all the users i _p G UE _P of the priority category p have been served (ie there is no longer any data to be transmitted for this category of data flow). In the latter case, if there are remaining resources, the time classifier CT passes the higher level secondary category data stream to the frequency classifier CF. And so on until all the RB resource units have been allocated or all the users of all the categories have not been served. We denote J = ..., / _P [t]} the set which describes the number of RB resource units allocated to each TTI t to each of the categories p GT _c .
In the context of the invention, the planner is configured to interrogate a LUT correspondence table in order to identify, from the current state s [t] of the multi-access telecommunications system, the resource allocation plan to implement. In the context of the preferred embodiment, one and / or the other of the time classifiers CT and frequency CF are configured to carry out, at each transmission time interval TTI, the interrogation of the correspondence table LUT to identify, from the current state s [t] of the multi-access telecommunications system, respectively the hierarchy rule and / or the first scheduling rule u _p (and where applicable the scheduling rule or rules applicable to the one or more secondary categories subject to resource allocation) to be selected.
The invention relates according to a first aspect to the creation of the correspondence table. It relates more particularly to a method for determining an allocation of resources to packets belonging to a plurality of categories of data flows in a multi-access telecommunications system sharing a plurality of transmission resources, this method comprising, for a state of the multi-access telecommunications system, the determination, by iterations of a reinforcement learning, of a resource allocation which maximizes a sum of rewards.
Each iteration of reinforcement learning includes:
- allocation of transmission resources to data flows according to a resource allocation plan;
- the transmission of packets by means of the allocated transmission resources;
- for each of the categories of data flows, the calculation of at least one transmission performance indicator (KPI for Key Performance Indicator) of each of the data flows of the category, and the comparison, for each of the data flows of the category, of at least one calculated transmission performance indicator (noted ^x p, i _p , o) θ a threshold (noted x _po ) representative of a QoS quality of service constraint relating to at least one performance indicator of transmission for the category, and
- the determination of a reward according to the result of said comparison for each of the data flows of each of the categories.
Thus, to allow satisfying the constraints in terms of QoS of heterogeneous categories of data flow, the invention uses reinforcement learning to learn the optimal resource allocation plan to be implemented by the planner. In the context of the preferred embodiment, reinforcement learning makes it possible to determine the hierarchy rule and / or the optimal scheduling rules to be implemented respectively by the time classifier which dynamically hierarchizes the different categories of workflow. data and by the frequency classifier which selects the most appropriate scheduling rule to allocate transmission resources to the activated data flow category (ie first the priority category then the secondary categories according to the hierarchy established by the time classifier ).
In one embodiment, the two time and frequency classifiers use a correspondence table resulting from reinforcement learning.
In another embodiment, only the frequency classifier uses a correspondence table resulting from reinforcement learning. This learning makes it possible to learn the most appropriate strategy for allocating transmission resources to the different categories of data flow and thus to determine the scheduling rule to be applied to a given category to optimize QoS. In such a case, the time classifier can be relatively simple, and for example implement a Round-Robin type strategy or a strategy based on fixed priorities for the categories of data flows (so that a first category is only processed by the frequency classifier when the higher priority categories have no data to transmit).
In yet another embodiment, only the time classifier uses a correspondence table resulting from reinforcement learning. This training makes it possible, for a current state of the system s [t], to determine the most appropriate rule for hierarchizing the categories of data flow. In such a case, the frequency classifier can be relatively simple, and for example implement the MCI strategy or the EDF strategy.
The following describes the implementation of reinforcement learning in the context of the preferred embodiment where the two classifiers use a correspondence table resulting from such learning.
In general, reinforcement learning aims to learn from experience what to do in different situations, so as to optimize a quantitative reward over time. At each instant t, a controller observes the complete state of the system s and performs a given action a. At time t + 1, the controller observes a new state s and receives a reward r which evaluates the optimality of the previous action. The controller goes through a set of transitions (ie from one state to another) and explores all possible state-action pairs until determining the decisional behavior (called strategy or policy, which is a function associating with the state current action to perform) optimal in the sense that it maximizes the sum of rewards over time.
Let 5 be the space of system states, and s = [s _1; ..., s _P ] ES the current state of the system, which describes the state of each of the data flow categories. The state _p of category p is defined by the state of each of its users,
sp
The elements which describe the state of each category can be separated between uncontrollable elements and controllable elements depending on whether or not they depend on the action performed by the controller. We thus note s _p = (s £, s _p ), V p AND _c with s _p a controllable state and s _p an uncontrollable state. The uncontrollable state s _p includes the number of active users by category, the CQI indicators of the users and their incoming data rate. The controllable state is characterized by the instantaneous performance performance indicators KPI (for Key Performance Indicator) of users and their deviations from QoS constraints. The N _Op instantaneous KPI indicators (one and / or the other among a transmission delay, a transmission rate and a transmission loss rate for example) of a user i _p of the category p can be described by the vector x _pip [t] =
[t], ^X P, ip, No _P M. The constraints of QoS of the category p with respect to these KPI indicators can be described by the vector x _p = [x _Pil , ..., ï _{p /} v _Op ], with x _po a threshold representative of a QoS quality of service constraint on the transmission performance indicator o for category p. Thus for each user the controllable state is written s _pip [t] = ^X P ^X P'ip [ÉQ ·
We denote A = A _CT x A _CF the set of possible actions of the reinforcement learning controller, with A _CT the set of actions corresponding to the time classifier and A _CF the set of actions corresponding to the frequency classifier. More specifically, A _CT = [A _CT1 , ..., A _CTN ] is a set of size N = P , where each action corresponds to a possible hierarchy of the P categories of data flows. And A _CF = [1, ..., (V _R ] is a set of scheduling rules of size N _R , where the action ie A _CF corresponds to the selection of the ith scheduling rule. Thus a = {a _CT , a _CF } GA is the pair of actions made up of the actions performed by each of the classifiers.
The reward function R calculates the optimality of applying the action a from a state s. It takes into account, for each of the data flows in each of the categories and for each performance indicator, the difference between the instantaneous indicator of the data flow and the threshold representative of a quality of service constraint relating to the indicator for the category to which the data flow belongs.
This reward function for the time classifier can be determined as R (s, a _cs , a _RS ) = Σ _{ρ = 1} r _p (sp, a _cs , u · ^, where r _p Çs _p , a _cs , u _p ^ is the reward perceived to the frequency classifier when the scheduling rule u _p is used for the category p of state s _P , given the selection of the hierarchy rule a _cs . Thus, the reward perceived to the time classifier corresponds the sum of all the rewards of the frequency classifier. At the level of the frequency classifier, the rewards function can be determined as rpÇsp.acs.Up) = Σ; ^Ρ = 1Σο ° ι (1 ^- * p, o ~ ^x p, ip, o |), namely as the sum of the rewards for each user i _p G UE _p and for each KPI o GO _p indicator.
According to the invention, reinforcement learning can achieve a compromise between exploitation and exploration, where exploitation consists in redoing the actions which, according to the experience acquired, will maximize the cumulative reward and where exploration consists in browsing couples (state, action) looking for a greater cumulative reward, but at the risk of sometimes adopting suboptimal behavior. More specifically, reinforcement learning comes to retain a compromise between exploration and exploitation which consists in following the current optimal policy most of the time, while choosing more or less regularly a random action.
Thus, reinforcement learning according to the invention comprises at each of the iterations:
according to a probability P _a , an operating selection consisting in selecting a resource allocation plan (the hierarchy rule and / or the first scheduling rule in the preferred embodiment), defined by action a, maximizing the sum of the rewards at this stage of the iterations; and according to a probability a selection of exploration consisting in selecting a resource allocation plan (the hierarchy rule and / or the first scheduling rule in the preferred embodiment) potentially sub-optimal, defined by the action a * Φ a.
In a first possible embodiment, we adopt the mechanism called ε-greedy according to which P _a = 1 - ε, ε being positive and strictly less than 1, and P _a * = ——, where K is equal to N or N _R according to whether we consider the hierarchy rule or the scheduling rule. Alternative solutions to this mechanism exist (for example Boltzmann exploration) which, during an iteration, associates with each action a probability of being chosen according to the rewards perceived in the previous iterations, when this action is selected in state s.
Using the example where the two classifiers use an optimized correspondence table via reinforcement learning, the implementation of the exploration / exploitation compromise is illustrated in Figure 2. In this figure, the state of the system at the moment ts [t] is provided to a first reinforcement learning module 1 responsible for determining the optimal policy in terms of hierarchy of categories of data flows. A first exploitation / exploration compromise module 2 will select, according to the ε-greedy mechanism, with a probability l-ε, the action a _CT GA _CT determined by the first reinforcement learning module 1 on the basis of the policy learned at this stage from iterations and, with a probability ε, a random action a * _CT .
The time classifier CT then successively supplies the different categories of data flow to the frequency classifier according to the selected hierarchy rule (ie according to the action a _CT or a * _CT selected). An iterative process then starts, which ends when all the categories of data flow have been served or when all the resources have been allocated.
If according to the action a _CT or a * _CT selected, the category p must be the subject of a scheduling at the nth iteration of this process, the state s _p of the users belonging to this category p is provided to a second reinforcement learning module 3 responsible for determining the optimal policy in terms of user scheduling. A second exploitation / exploration compromise module 4 will select, with a probability l-ε, the action a _CF = u _p determined by the second reinforcement learning module 3 on the basis of the policy learned at this stage of the iterations and, with probability ε, a random action u _p . As for the first module 2, other mechanisms than ε-greedy (such as Boltzmann exploration) can be selected to optimize the compromise between exploration and exploitation.
Next, a reward r _p _(P s, (a _cs, _T} ac [u _p, u _p}) is associated with the action u _p u or _p selected. The set of actions u _p u or _p put implemented for each of the P categories is noted a _CF. When the iterative process ends (when all the categories of data flow have been served or when all of the resources have been allocated), an overall reward RÇs, (a _CT , a _FT }, a _CF ) which depends on the rewards perceived for each category is associated with the action a _CT or a * _CT implemented by the time classifier CT. This global reward is used to update the modules d reinforcement learning 1, 3. This update typically exploits the difference between the expected reward which is based on past experience and the new reward. When this difference becomes small enough (eg less than 10 ' ³ ), the learning process can be interrupted t the learned policy can be stored in the LUT correspondence table. The CT and CF classifiers can then simply consult this table to select the optimal hierarchy and scheduling rules for a given current state of the telecommunications system. It will be noted that when in a possible embodiment described below, the function linking the state to the action is approximated by an approximation function, and the coefficients of the approximation function are stored in the correspondence table instead of the optimal policy.
Different variant embodiments are described below.
System state representation
In order to reduce the complexity of the solution proposed by the invention and to speed up the learning process, different variant embodiments can be implemented, independently or jointly, to simplify the representation of the state of the system. As mentioned earlier, this state is made up of controllable and uncontrollable elements. These variant embodiments specifically aim to effectively represent the state according to this classification.
Controllable state space compression
According to this variant, the variables of the controllable state s _p (for example delay, rate and rate of packet loss) of each of the categories of data streams are the subject of a statistical treatment, for example to calculate the mean and variance. In this way, the controllable state is no longer dependent on the number of users I _p in each of the categories of data flow. The dimension of the state is thus reduced from ΣρΕ ^ / ρ 'O _p to' Z _{p & Tc} 2 O _p .
Uncontrollable state space compression
According to this variant, the variables of the uncontrollable state s _p of each of the categories are subject to statistical treatment so as to reduce the size of this state. For certain parameters, as for the CQI indicator, it is possible to eliminate the dependence on the number of users / _p in the category and on the bandwidth (ie the numbers N _RB of available resource units). On the contrary, for other parameters, such as the characteristics of the data flow, it is only possible to eliminate the dependence on the number of users. Clustering or classification algorithms can be used for this purpose. Among the grouping algorithms, we can cite the Lloyd or Swap heuristics. Among the classification algorithms, we can cite support vector machines (SVM for Support
Vector Machine) or Radial Basis Function based Neural Networks (RBFNN).
Reward
In the foregoing, an example of the calculation of a reward function has been presented. Other functions can however be calculated, all depending on the KPIs and QoS constraints. The reward allocated to the time classifier can for example be normalized by the number of categories of data flow, while the reward attributed to the frequency classifier can for example be normalized by the number of users and the number of KPIs in the category considered.
Reinforcement learning techniques
Many reinforcement learning techniques have been proposed in the literature and can be used to create the correspondence table. In general, the objective of reinforcement learning is to identify the policy π which minimizes (respectively maximizes) the expected value of a certain function / (π) representative of a cost (respectively of a reward) associated with the implementation of the π policy.
This function / (π) can represent a sum of devalued rewards:
y ^t r _{t + 1} | s ₀ , 7T>,
Where s ₀ is the initial state, r _{t + 1} is the reward perceived when the policy π is implemented starting from the state s _o and γ e [0; 1] is a devaluation rate that determines the current value of future rewards.
The function / (π) can also be represented as an average reward according to:
(n-1 / ît + lk
Z — i t = 0
In addition, reinforcement learning algorithms can be divided into three groups: pure critic, pure actor, actor-critic.
E =
"Pure critical" methods
These methods learn the optimal strategy π using a function Q ⁿ (s, a) which evaluates the value of each state-action pair (s, a) when there is no explicit function to represent and evaluate the policy in process of learning. When the objective is to maximize a devalued sum of rewards, Q ⁿ (s, a) can be defined according to:
CO X ^ y ^t r (s _t , a _t ) | s ₀ = s; a ₀ = a, π. t = o
According to Bellman's optimality equation, the optimal value Q * (s, a) can be calculated recursively as follows:
Q * (s, a) = EÎr (s, a) + ymaxQ (s ', a') | her)
V aeA J
Where s' is the next state observed when action a is applied in the presence of state s. After convergence, it is possible to determine the optimal action a * for the current state s, so that Q * (s, a) is maximum:
a * = argmaxQ * (s, a) a'e A
In the end, the optimal state-action pairs are stored in the correspondence table.
Pure actor methods
These methods work with a family of parametric policies aimed at optimizing the function / (π) directly on the parameter space of the policy. A performance gradient vis-à-vis the actor parameters is directly estimated by simulation, and the parameters are updated in a direction of improvement. Pure actor methods have the advantage over pure critical methods of allowing politics to generate actions in a completely continuous action space.
Let us consider a parametrization of the policy π by the vector of parameters Θ. is a function of the parametric policy π _θ , and is a function of Θ. The gradient of J with respect to Θ is described by:
d] δπ _θ ^eJ δπ _θ δθ
Then, using conventional optimization techniques (such as gradient descent), a locally optimal solution of the function J can be identified. A disadvantage of the pure actor approach is that the estimated gradient can have a large variance, which increases the time required for learning.
Actor-critical methods
These methods aim to combine the advantages of pure actor and pure critical methods. As with pure actor methods, these methods are capable of prescribing continuous actions, while the large variance of pure actor methods is counteracted by adding a criticism whose role is to assess the current policy prescribed by the actor.
Approximate function of reinforcement learning
The state-action space can prove to be combinative and immense. The problem then is not only the amount of memory necessary for the correspondence table, but also the time and the data necessary to carry out the learning of the optimal policy.
Approximation by a linear function
It is possible to use an approximate function / (-, Θ) in the form of a linear function of a vector of weight Θ. Then, to each state s corresponds a vector of characteristics <jp (s) = (cpfs), ..., <p _n (s)) having the same number of components as θ. There are many ways to build characteristics from states, for example the basic Fourier function or Tile Coding. In general, the approximate function is given by the dot product between Θ and <jp (s): / (-, θ) = Θ ^Ύ <jp (s) = Σ ”= ι θί '<Pi (s).
The individual functions gp are called basic functions because they form a linear base for the set of linear functions of this form. Constructing a vector of characteristics of n dimensions to represent states is equivalent to selecting a set of n basic functions. It is natural to use a stochastic gradient descent with a linear function approximation to update Θ iteratively until convergence.
Approximation by means of a fuzzy inference system
To effectively represent the state of a system, a well-known solution consists in integrating a fuzzy inference system (FIS for Fuzzy Inference System) in the Q-learning algorithm. The FIS system is based on fuzzy logic in which, unlike standard logic, the truth of any statement is a matter of degree, ie is noted by a value between 0 and 1. Furthermore, the FIS system uses linguistic variables, that is to say variables whose values are words, such as "high" or "low". Each linguistic variable X; is associated with a set of terms T (xi ~) which includes all the fuzzy sets corresponding to the linguistic values of Xj. The learning phase consists of three phases: fuzzification, calculation of truth values and defuzzification.
Approximation by a neural network
Neural networks are a powerful tool for estimating the optimal function / (-, Θ). As for the linear approximation approach, during the learning phase, at iteration /, the new weights Θ; are calculated so that the root mean square error in the Bellman equation is reduced. However, in this case, the approximate function is calculated as a non-linear function of the weights Θ. More particularly, if we consider a predictive neural network at three levels, with a hidden level composed of D nodes and an input level composed of M nodes, the output k can be calculated as follows:
y _k (x, θ) = σ i9 ² ; - h + i ² 0
Where x corresponds to the inputs of the neural network, Θ is the set of weights in the neural network, h is the activation function of the hidden nodes (a differentiable nonlinear sigmoid function such as the logistic sigmoid or the tangent hyperboloid ), and σ is the activation function (the identity function is used here) of the output nodes.
The invention thus relates to a method for determining a resource allocation which implements the reinforcement learning described above to determine the optimal resource allocation plan for a given state of the telecommunications system. It also relates to the exploitation of this learning and thus relates to a packet scheduling method comprising at each transmission time interval the selection of a resource allocation plan and an allocation of the transmission resources to the data flows in accordance the resource allocation plan selected. This selection is made by interrogating a correspondence table whose content results from the implementation of reinforcement learning and which makes it possible to identify, from the current state of the multi-access telecommunications system, the resource allocation plan to be selected. The invention extends to a computer program product comprising program code instructions for executing the scheduling method when said program is executed by a computer, as well as to a node of a telecommunications system. multi-access which includes a correspondence table whose content results from the implementation of reinforcement learning and a scheduler configured to, at each transmission interval, interrogate the correspondence table to identify, from the state of the multi-access telecommunications system, a resource allocation plan, and allocate transmission resources to the data streams in accordance with the identified resource allocation plan.

权利要求:
Claims (8)
[1" id="c-fr-0001]
1. Method for determining an allocation of resources for packets belonging to a plurality of categories of data flows in a multi-access telecommunications system sharing a plurality of transmission resources, characterized in that it comprises, for a state of the multi-access telecommunications system, the determination, by iterations of reinforcement learning, of a resource allocation plan coming to maximize a sum of rewards, each iteration of reinforcement learning comprising:
- allocation of transmission resources to data flows according to a resource allocation plan;
- the transmission of packets by means of the allocated transmission resources;
- for each of the categories of data flows, the calculation of at least one transmission performance indicator for each of the data flows of the category, and the comparison, for each of the data flows of the category, of the at least one transmission performance indicator calculated at a threshold representative of a quality of service constraint relating to the at least one transmission performance indicator for the category, and
- the determination of a reward according to the result of said comparison for each of the data flows of each of the categories.
[2" id="c-fr-0002]
2. Method according to claim 1, in which the resource allocation plan comprises a hierarchy rule and at least a first and a second scheduling rule, the allocation of transmission resources comprising two steps consisting of:
- scheduling in the time domain performed by a time classifier (CT) configured to prioritize the categories of data flow according to the hierarchy rule into at least one priority category and one secondary category; and
- scheduling in the frequency domain carried out by a frequency classifier (CF) configured for:
o scheduling the data streams of the priority category according to the first scheduling rule and allocating transmission resources to the data streams of the priority category thus scheduled; and o in the case of remaining resources, schedule the data streams of the secondary category according to the second scheduling rule and allocate remaining transmission resources to the data streams of the secondary category thus scheduled.
[3" id="c-fr-0003]
3. Method according to claim 2, in which the reward determined at each iteration is the sum Σρ = ι ^r pÇ ^a cT ' ^u p with Γ _Ρ (α _εΓ . " _Ρ ) = Σ' :.
| - ^ p, 0 ^- Χρ, ίρ, Ο |), where P denotes the number of categories, aCT the hierarchy rule, up the ordering rule applied to the data flows of category p, Ip the number of flows of data of category p, NOp the number of transmission performance indicators of category p, xpo the threshold representative of a quality of service constraint relating to the transmission performance indicator o for category p and ^x p, ip, o the transmission performance indicator o of the data flow i _p .
[4" id="c-fr-0004]
4. Method according to one of claims 1 to 3, wherein each iteration of reinforcement learning comprises:
- according to a probability P _a , a selection of operations consisting in selecting the resource allocation plan coming to maximize the sum of the rewards at this stage of the iterations; and
- according to a probability P _a *, a selection of exploration consisting in selecting a resource allocation plan different from the resource allocation plan coming to maximize the sum of the rewards at this stage of the iterations.
[5" id="c-fr-0005]
5. Method according to one of claims 1 to 4, wherein the at least one transmission performance indicator is one of a transmission delay, a transmission rate and a transmission loss rate.
[6" id="c-fr-0006]
6. Method for scheduling packets belonging to a plurality of categories of data streams in a multi-access telecommunications system sharing a plurality of transmission resources, characterized in that it comprises, at each transmission time interval, the selection of a resource allocation plan and allocation of transmission resources to the data streams in accordance with the selected resource allocation plan, said selection being carried out by interrogation of a correspondence table (LUT) from which the content results of the implementation of the method according to one of claims 1 to 5 to and which makes it possible to identify, from the current state of the multi-access telecommunications system (s [t) J), the plan of allocation of resources to be selected.
[7" id="c-fr-0007]
7. A computer program product comprising program code instructions for executing the method of claim 6 when said program is executed by a computer.
[8" id="c-fr-0008]
8. Node of a multi-access telecommunications system sharing a plurality of transmission resources between packets belonging to a plurality of categories of data flows, comprising a correspondence table (LUT) whose content results from the implementation of method 1 to 5 and a scheduler configured to, at each transmission interval, interrogate the correspondence table to identify, from the current state of the multi-access telecommunications system (s [t) J), a plan d allocation of resources, and allocating transmission resources to the data streams in accordance with the resource allocation plan identified.

类似技术:

公开号 | 公开日 | 专利标题

EP3474619B1|2021-05-05|Method of reinforcement learning for allocation of transmission resources

Sundararaj2019|Optimal task assignment in mobile cloud computing by queue based ant-bee algorithm

EP2100418B1|2011-04-06|Allocation of network resources

Fu et al.2008|A new systematic framework for autonomous cross-layer optimization

Lashgari et al.2013|Timely throughput of heterogeneous wireless networks: Fundamental limits and algorithms

Tan et al.2020|Intelligent sharing for LTE and WiFi systems in unlicensed bands: A deep reinforcement learning approach

Tutuncuoglu et al.2011|Short-term throughput maximization for battery limited energy harvesting nodes

Yao et al.2019|Machine learning aided load balance routing scheme considering queue utilization

CN109151077A|2019-01-04|One kind being based on goal-oriented calculating discharging method

US10552763B2|2020-02-04|Constraint-aware resource synchronization across hyper-distributed learning systems

US20170153925A1|2017-06-01|Network

Barshan et al.2016|Deadline-aware advance reservation scheduling algorithms for media production networks

Li et al.2020|Adaptive service function chaining mappings in 5G using deep Q-learning

Othman et al.2019|Efficient admission control and resource allocation mechanisms for public safety communications over 5G network slice

Ridwan et al.2021|Applications of machine learning in networking: a survey of current issues and future challenges

Chen et al.2021|DRL-QOR: Deep reinforcement learning-based QoS/QoE-aware adaptive online orchestration in NFV-enabled networks

Abbasi et al.2021|Deep Reinforcement Learning for QoS provisioning at the MAC layer: A Survey

Newton et al.2011|A novel prediction technique to improve quality of service | for heterogeneous data traffic

Guerra-Gomez et al.2020|Machine learning adaptive computational capacity prediction for dynamic resource management in C-RAN

US20200389390A1|2020-12-10|Per-flow call admission control using a predictive model to estimate tunnel qos in sd-wan networks

Mohan et al.2018|A survey on long term evolution scheduling in data mining

Anifantis et al.2014|A spatio-stochastic framework for cross-layer design in cognitive radio networks

Lu et al.2010|Intelligent network design: User layer architecture and its application

Quach et al.2020|Survey on Reinforcement Learning based Efficient Routing in SDN

Meshinchi2018|QoS-Aware and Status-Aware Adaptive Resource Allocation Framework in SDN-Based IoT Middleware

同族专利:

公开号 | 公开日

US10779292B2|2020-09-15|

EP3474619B1|2021-05-05|

EP3474619A1|2019-04-24|

US20190124667A1|2019-04-25|

FR3072851B1|2019-11-15|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

FR2875658B1|2004-09-21|2007-03-02|Commissariat Energie Atomique|ESTIMATING THE MOST NOISE INTERFERENCE SIGNAL RATIO AT THE OUTPUT OF AN OFDM-CDMA RECEIVER.|

WO2006095061A1|2005-03-08|2006-09-14|Commissariat A L'energie Atomique|Method for the flexible demodulation of estimated complex symbols|

FR2904499B1|2006-07-27|2009-01-09|Commissariat Energie Atomique|METHOD FOR DECODING MESSAGES WITH ORDERING ACCORDING TO NEIGHBORHOOD RELIABILITY.|

FR2930856B1|2008-04-30|2010-06-18|Commissariat Energie Atomique|MULTI-ACCESS TELECOMMUNICATION SYSTEM WITH ADAPTED PACKET RETRANSMISSION STRATEGY|

FR2934108B1|2008-07-21|2010-09-17|Commissariat Energie Atomique|METHOD OF ORDERING PACKETS|

JP6005879B2|2012-12-20|2016-10-12|テレコム・イタリア・エッセ・ピー・アー|Method and system for scheduling radio resources in a cellular network|

FR3001850B1|2013-02-04|2015-03-06|Commissariat Energie Atomique|METHOD FOR MODIFYING THE STATE OF LOCAL ACCESS POINTS IN A CELLULAR NETWORK|

FR3003711B1|2013-03-21|2015-04-17|Commissariat Energie Atomique|COOPERATIVE COMMUNICATION SYSTEM WITH ADAPTIVE PACKET RETRANSMISSION STRATEGY|

CN110121854B|2016-11-16|2022-03-04|瑞典爱立信有限公司|Method and apparatus for adapting load on a fronthaul network|

FR3070815B1|2017-09-01|2019-09-13|Commissariat A L'energie Atomique Et Aux Energies Alternatives|METHOD FOR LOAD DISTRIBUTION IN A HETEROGENEOUS NETWORK WITH MULTITECHNOLOGY OF RADIO ACCESS|US11153229B2|2018-01-19|2021-10-19|Ciena Corporation|Autonomic resource partitions for adaptive networks|

WO2021025601A1|2019-08-06|2021-02-11|Telefonaktiebolaget Lm Ericsson |Methods and nodes in a communications network|

WO2021188022A1|2020-03-17|2021-09-23|Telefonaktiebolaget Lm Ericsson |Radio resource allocation|

CN112351433B|2021-01-05|2021-05-25|南京邮电大学|Heterogeneous network resource allocation method based on reinforcement learning|

CN112367638A|2021-01-12|2021-02-12|华东交通大学|Intelligent frequency spectrum selection method for vehicle-vehicle communication of urban rail transit vehicle|

CN113115451A|2021-02-23|2021-07-13|北京邮电大学|Interference management and resource allocation scheme based on multi-agent deep reinforcement learning|

CN113810910A|2021-09-18|2021-12-17|大连理工大学|Deep reinforcement learning-based dynamic spectrum sharing method between 4G and 5G networks|

法律状态:
2018-10-30| PLFP| Fee payment|Year of fee payment: 2 |

2019-04-26| PLSC| Publication of the preliminary search report|Effective date: 20190426 |

2019-10-31| PLFP| Fee payment|Year of fee payment: 3 |

2020-10-30| PLFP| Fee payment|Year of fee payment: 4 |

2021-10-29| PLFP| Fee payment|Year of fee payment: 5 |

优先权:

申请号 | 申请日 | 专利标题

FR1759982A|FR3072851B1|2017-10-23|2017-10-23|REALIZING LEARNING TRANSMISSION RESOURCE ALLOCATION METHOD|

FR1759982|2017-10-23|FR1759982A| FR3072851B1|2017-10-23|2017-10-23|REALIZING LEARNING TRANSMISSION RESOURCE ALLOCATION METHOD|

US16/166,564| US10779292B2|2017-10-23|2018-10-22|Method for allocating transmission resources using reinforcement learning|

EP18201694.9A| EP3474619B1|2017-10-23|2018-10-22|Method of reinforcement learning for allocation of transmission resources|

[返回顶部]